UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

نویسندگان

Milan Straka

Jan Hajic

Jana Straková

چکیده

Automatic natural language processing of large texts often presents recurring challenges in multiple languages: even for most advanced tasks, the texts are first processed by basic processing steps – from tokenization to parsing. We present an extremely simple-to-use tool consisting of one binary and one model (per language), which performs these tasks for multiple languages without the need for any other external data. UDPipe, a pipeline processing CoNLL-U-formatted files, performs tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing for nearly all treebanks of Universal Dependencies 1.2 (namely, the whole pipeline is currently available for 32 out of 37 treebanks). In addition, the pipeline is easily trainable with training data in CoNLL-U format (and in some cases also with additional raw corpora) and requires minimal linguistic knowledge on the users’ part. The training code is also released.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe

We present and update to UDPipe 1.0 (Straka et al., 2016), a trainable pipeline which performs sentence segmentation, tokenization, POS tagging, lemmatization and dependency parsing. We provide models for all 50 languages of UD 2.0, and furthermore, the pipeline can be trained easily using data in CoNLL-U format. For the purpose of theCoNLL 2017 Shared Task: Multilingual Parsing from Raw Text t...

متن کامل

Prague at EPE 2017: The UDPipe System

We present our contribution to The First Shared Task on Extrinsic Parser Evaluation (EPE 2017). Our participant system, the UDPipe, is an open-source pipeline performing tokenization, morphological analysis, part-of-speech tagging, lemmatization and dependency parsing. It is trained in a language agnostic manner for 50 languages of the UD version 2. With a relatively limited amount of training ...

متن کامل

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

This paper describes our system (HITSCIR) for the CoNLL 2017 shared task: Multilingual Parsing from Raw Text to Universal Dependencies. Our system includes three pipelined components: tokenization, Part-of-Speech (POS) tagging and dependency parsing. We use character-based bidirectional long shortterm memory (LSTM) networks for both tokenization and POS tagging. Afterwards, we employ a list-bas...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

The Effect of Automatic Tokenization, Vocalization, Stemming, and {POS} Tagging on {A}rabic Dependency Parsing

We use an automatic pipeline of word tokenization, stemming, POS tagging, and vocalization to perform real-world Arabic dependency parsing. In spite of the high accuracy on the modules, the very few errors in tokenization, which reaches an accuracy of 99.34%, lead to a drop of more than 10% in parsing, indicating that no high quality dependency parsing of Arabic, and possibly other morphologica...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

UDPipe: Trainable Pipeline for Processing CoNLL-U Files Performing Tokenization, Morphological Analysis, POS Tagging and Parsing

نویسندگان

چکیده

منابع مشابه

Tokenizing, POS Tagging, Lemmatizing and Parsing UD 2.0 with UDPipe

Prague at EPE 2017: The UDPipe System

The HIT-SCIR System for End-to-End Parsing of Universal Dependencies

An improved joint model: POS tagging and dependency parsing

The Effect of Automatic Tokenization, Vocalization, Stemming, and {POS} Tagging on {A}rabic Dependency Parsing

عنوان ژورنال:

اشتراک گذاری